25 - Deep Learning - Common Practices Part 4 [ID:15997]

50 von 147 angezeigt

Welcome everybody to the next part of deep learning. Today we want to finish talking

about common practices and in particular we want to have a look at the evaluation.

Machine learning is the science of sloppiness really.

So of course we need to evaluate the performance of the models that we've trained so far and

now we have said the training said the hyperparameters we all estimated this and now we want to evaluate

the generalization performance on previously unseen data.

This means the test data and it's time to open the vault.

Remember of all things the measure is man.

Humans are a low bar to exceed.

So data is annotated and labeled by humans and during training all labels are assumed

to be correct but of course to err is human.

All input is error.

Which means that in addition we may have a biggest data.

The ideal situation that you actually want to have for your data is that it has been

annotated by multiple human voters and then you can take the mean or a majority vote.

There's also a very nice paper by Stefan Steidl from 2005 and it introduces an entropy

based measure that takes into account the confusions of human reference labelers.

So this is very useful in situations where you have unclear labels in particular in

emotional recognition this is a problem and also humans confuse sometimes classes like

angry versus annoyed while they are not very likely to confuse angry versus happy.

So this is a very clear distinction but of course there's different degrees of happiness

sometimes you're just a little bit happy and then it makes it really difficult to differentiate

happy from neutral and this is also hard for humans.

So in prototypes if you have actors playing you get emotion recognition rates way over

90 percent but if you have real data emotion if you have emotions as they occur in daily

life it's much harder to predict.

So this can then also be seen in the labels and in the distribution of the labels.

If you have a prototype all of the raters will agree it's clearly this particular class.

If you have nuances and not so clear emotions you will see that also our raters will have

more or less uniform distribution over the labels because they also can't assess the

specific sample.

So mistakes by the classifier are obviously less severe if the same class is also confused

by humans and this is considered in this entropy based measure.

Now if we look into performance measure you want to take into account the typical classification

measures and they are typically built around the false negatives, the true negatives, the

true positives and the false positives.

And from that for binary classification problems you can then compute true false positive rates.

So this typically then leads to numbers like the accuracy that is the number of true positives

plus true negatives over the number of positives and negatives.

Then there is the precision or positive predictive value that is computed as the number of true

positives over the number of true positives plus false positives.

There's the so-called recall that is defined as the true positives over the true positives

plus the false negatives.

Specificity or true negative value is given as the true negatives over the true negatives

plus the false positives and the F1 score which is then somehow an intermediate way

mixing those different measures where you have the true positive value times the true

negative value divided over the sum of true positive and true negative value.

I typically recommend receiver operating characteristic curves because all of the measures that you've

seen above they are dependent on thresholds.

And if you have the ROC curves there you essentially evaluate your classifier for all different

Teil einer Videoserie :

Deep Learning

Presenters

Prof. Dr.-Ing. Andreas Maier

Zugänglich über

Offener Zugang

Dauer

00:12:28 Min

Aufnahmedatum

2020-05-16

Hochgeladen am

2020-05-17 01:36:15

Sprache

en-US

Deep Learning - Common Practices Part 4

This video discusses how to evaluate deep learning approaches.

Video References:
Lex Fridman's Channel

Further Reading:
A gentle Introduction to Deep Learning

References:
[1] M. Aubreville, M. Krappmann, C. Bertram, et al. “A Guided Spatial Transformer Network for Histology Cell Differentiation”. In: ArXiv e-prints (July 2017). arXiv: 1707.08525 [cs.CV].
[2] James Bergstra and Yoshua Bengio. “Random Search for Hyper-parameter Optimization”. In: J. Mach. Learn. Res. 13 (Feb. 2012), pp. 281–305.
[3] Jean Dickinson Gibbons and Subhabrata Chakraborti. “Nonparametric statistical inference”. In: International encyclopedia of statistical science. Springer, 2011, pp. 977–979.
[4] Yoshua Bengio. “Practical recommendations for gradient-based training of deep architectures”. In: Neural networks: Tricks of the trade. Springer, 2012, pp. 437–478.
[5] Chiyuan Zhang, Samy Bengio, Moritz Hardt, et al. “Understanding deep learning requires rethinking generalization”. In: arXiv preprint arXiv:1611.03530 (2016).
[6] Boris T Polyak and Anatoli B Juditsky. “Acceleration of stochastic approximation by averaging”. In: SIAM Journal on Control and Optimization 30.4 (1992), pp. 838–855.
[7] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. “Searching for Activation Functions”. In: CoRR abs/1710.05941 (2017). arXiv: 1710.05941.
[8] Stefan Steidl, Michael Levit, Anton Batliner, et al. “Of All Things the Measure is Man: Automatic Classification of Emotions and Inter-labeler Consistency”. In: Proc. of ICASSP. IEEE - Institute of Electrical and Electronics Engineers, Mar. 2005.

Tags

Per RSS abonnieren